add dcgm-exporter daemonset for nvidia #81

movence · 2024-02-06T15:25:12Z

Description of changes:

Add new dcgm-exporter daemonset to support NVIDIA GPU metrics with k8s
Add new service that uses dcgm-exporter pods where the traffic is limited within node (no cross nodes communication)

The new dcgm-exporter pod has nodeAffinity config that spins up itself in only GPU nodes. The supported GPU instances are hard coded into values.yaml which can be extended when there are more GPU instances/types supported.

Below is the list of pods in test cluster where 1 GPU and 1 non-gpu nodes are.

NAME                                                              READY   STATUS    RESTARTS   AGE
amazon-cloudwatch-observability-controller-manager-xxxxxx   1/1     Running   0          4m17s
cloudwatch-agent-xxxxxx                                            1/1     Running   0          4m16s
cloudwatch-agent-xxxxxx                                            1/1     Running   0          4m17s
dcgm-exporter-xxxxxx                                               1/1     Running   0          4m18s
fluent-bit-xxxxxx                                                 1/1     Running   0          4m18s
fluent-bit-xxxxxx                                                  1/1     Running   0          4m17s

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

lisguo · 2024-02-06T18:23:22Z

helm/templates/dcgm-exporter-daemonset.yaml

+        version: v1
+    spec:
+      priorityClassName: system-node-critical
+      serviceAccountName: {{ template "cloudwatch-agent.serviceAccountName" . }}


should we make a different service account for the dcgm exporter instead of using the cloudwatch agent service account? There might be different privileges

Yeah, great call. updated

lisguo · 2024-02-06T18:24:35Z

helm/values.yaml

@@ -35,10 +39,12 @@ containerLogs:
 manager:
  name:
  image:
-    repository: cloudwatch-agent-operator
+    # repository: cloudwatch-agent-operator


probably need to revert to cloudwatch-agent-operator

lisguo · 2024-02-06T18:24:43Z

helm/values.yaml

    tag: 1.0.2
    repositoryDomainMap:
-      public: public.ecr.aws/cloudwatch-agent
+      # public: public.ecr.aws/cloudwatch-agent


probably need to revert this

lisguo · 2024-02-06T18:26:41Z

helm/values.yaml

-    tag: 1.300031.1b317
+    # repository: cloudwatch-agent
+    # tag: 1.300031.1b317
+    repository: tupperware


Are there changes to the agent that we need to make to get this to work?

nope. removed test repo

lisguo · 2024-02-06T18:26:49Z

helm/values.yaml

@@ -137,6 +146,9 @@ agent:
  config: # optional config that can be provided to override the defaultConfig
  defaultConfig:
    {
+      "agent": {
+        "debug": true


lisguo · 2024-02-06T18:27:15Z

helm/values.yaml

+        "metrics_collected": {
+          "kubernetes": {
+            "gpu_metrics": true


instead of having this as true, can we just have it an empty map in case we want to add more config options in the future?

removed. it was a leftover from the previous revision where there were another agent daemonset specifically for GPU nodes.

lisguo · 2024-02-06T18:27:31Z

helm/values.yaml

+  name:
+  image:
+    repository: nvcr.io/nvidia/k8s/dcgm-exporter


I am assuming this will be our mirrored repo eventually

add dcgm daemonset for nvidia

511a3ac

movence requested a review from sky333999 February 6, 2024 15:25

lisguo reviewed Feb 6, 2024

View reviewed changes

movence closed this Feb 6, 2024

movence mentioned this pull request Feb 13, 2024

add dcgm daemonset for nvidia #89

Merged

mitali-salvi deleted the ci-nvidia-gpu branch July 26, 2024 20:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add dcgm-exporter daemonset for nvidia #81

add dcgm-exporter daemonset for nvidia #81

movence commented Feb 6, 2024 •

edited

Loading

lisguo Feb 6, 2024

movence Feb 6, 2024

lisguo Feb 6, 2024

lisguo Feb 6, 2024

lisguo Feb 6, 2024

movence Feb 6, 2024

lisguo Feb 6, 2024

lisguo Feb 6, 2024

movence Feb 6, 2024

lisguo Feb 6, 2024

movence Feb 6, 2024

add dcgm-exporter daemonset for nvidia #81

add dcgm-exporter daemonset for nvidia #81

Conversation

movence commented Feb 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

movence commented Feb 6, 2024 •

edited

Loading